Finnish parliamentary elections 2015 in Helsinki (Eduskuntavaalit Helsingissä 2015)

Research Questions

1st Research Question: Is there a green-red voter bubble in Helsinki?

2nd Research Question: What regional characteristics predict, left-wing party, Vasemmistoliitto’s support?

Data Wrangling

To answer these questions I have built a dataset that consists of regional open data from Aluesarjat and Tilastokeskus. Constructing the dataset required couple thousand lines of R code. It would not be fair to put someone to evaluate that amount of R code, so I chose only to provide one file to illustrate what enormous amount of work the data wrangling was. In the wrangling script I introduced regional data which was later included into the final dataset. You can see the data wrangling file here: Data wrangling. I will not discuss the details of the dataset, but the most effort was needed to combine regional data to voting districts.

Explaining the Data

Before we get started with the analysis I want to introduce two Shiny applications which I programmed to visualize the data. Please take a look at Helsinki clusters app and Helsinki parties app. I used leaflet and plotlyto create dynamic web applications. The apps are still in development and I will add more features to them as I have more time.

Here is the list of variables which are included in the analysis:

  • aanestysalue_nro = Number of the voting district (may include several)
  • aanestysalue_nimi = Name of the voting district (may include several)
  • Kuluyks_keskiarvo2014 = Average income of a person living in the region
  • Kuluyks_mediaani2014 = Median income of a person living in the region
  • Gini2014 = Gini coefficient for the region
  • IP.p to VIHR.p = Vote share of a party in the region (IP = Itsenäisyyspuolue, VIHR = Vihreät)
  • Punavih.p = Vote share of green-reds (Vasemmistoliitto and Vihreät, VAS.p + VIHR.p)
  • Naisia_2015.p = Share of women living in the region
  • Ruotsinkielisia_2015.p = Share of Swedish speaking people in the region
  • Korkeakoulutus2015.p = Share of higher education of the people living in the region
  • Tyottomat2014.p = Unemployment percentage for the region
  • SuurituloisinX_2014.p = Share of people in the region belonging to the highest 10% income decile
  • Allemediaanitulot_2014.p = Share of people in the region belonging to the income deciles bellow median income
  • Nolla_17.p to Yli_65.p = Share of people having age of 0-17 in the region; other age groups are 18-29, 30-44, 45-64 and 65-
  • Suurinpuolue = Name of the party that has the largest support in the region
library(xlsx)
library(dplyr)
library(shiny)
df <- read.xlsx2("helsinki_election_2015.xlsx", startRow = 1, endRow = 106, sheetIndex = 1, header = T)

df <- df %>%
  dplyr::select(aanestysalue_nro, aanestysalue_nimi, Kuluyks_keskiarvo2014, Kuluyks_mediaani2014,Gini2014,IP.p:Punavih.p,Naisia_2015.p:Yli_65.p,Suurinpuolue)

min <- which(colnames(df) == "Kuluyks_keskiarvo2014")
max <- which(colnames(df) == "Kuluyks_mediaani2014")

df[,min:max] <- df[,min:max] %>%
  lapply(., function(x){as.numeric(as.character(x))})

min <- which(colnames(df) == "Gini2014")
max <- which(colnames(df) == "Yli_65.p")
for(x in min:max){
  df[,x] <- as.numeric(as.character(sub(",", ".", df[,x])))
}


str(df)
## 'data.frame':    105 obs. of  34 variables:
##  $ aanestysalue_nro        : Factor w/ 105 levels "10A, 10B","10C",..: 18 24 25 27 29 30 35 36 37 43 ...
##  $ aanestysalue_nimi       : Factor w/ 105 levels "Ala-Malmi A",..: 13 50 51 96 7 9 56 57 58 27 ...
##  $ Kuluyks_keskiarvo2014   : num  25494 24541 26016 30031 26849 ...
##  $ Kuluyks_mediaani2014    : num  23771 22092 23296 26456 24304 ...
##  $ Gini2014                : num  0.258 0.244 0.26 0.277 0.286 ...
##  $ IP.p                    : num  0.005 0.004 0.003 0.005 0.004 0.001 0 0.004 0.005 0.001 ...
##  $ KA.p                    : num  0.001 0.001 0.002 0.001 0 0.001 0 0.001 0 0.001 ...
##  $ KD.p                    : num  0.017 0.018 0.016 0.021 0.024 0.024 0.012 0.019 0.014 0.026 ...
##  $ KESK.p                  : num  0.05 0.074 0.066 0.083 0.086 0.095 0.073 0.069 0.084 0.076 ...
##  $ KOK.p                   : num  0.134 0.159 0.2 0.239 0.278 0.303 0.526 0.304 0.324 0.226 ...
##  $ KTP.p                   : num  0.001 0 0 0 0 0 0 0 0 0 ...
##  $ Muut.p                  : num  0.003 0.002 0.005 0.002 0.001 0.001 0 0.002 0.002 0.001 ...
##  $ Muutos2011.p            : num  0.001 0.002 0.002 0.001 0.002 0 0 0 0.001 0.004 ...
##  $ Piraattip..p            : num  0.022 0.016 0.015 0.011 0.018 0.013 0.01 0.012 0.015 0.016 ...
##  $ PS.p                    : num  0.101 0.152 0.129 0.14 0.076 0.092 0.051 0.078 0.076 0.14 ...
##  $ RKP.p                   : num  0.082 0.054 0.042 0.048 0.086 0.078 0.127 0.116 0.085 0.079 ...
##  $ SDP.p                   : num  0.147 0.193 0.19 0.169 0.11 0.103 0.051 0.125 0.114 0.204 ...
##  $ SKP.p                   : num  0.005 0.016 0.014 0.007 0.004 0.002 0 0.004 0.003 0.003 ...
##  $ STP.p                   : num  0.001 0.001 0.001 0 0 0 0 0 0 0.001 ...
##  $ VAS.p                   : num  0.191 0.125 0.125 0.104 0.086 0.076 0.03 0.067 0.058 0.087 ...
##  $ VIHR.p                  : num  0.241 0.183 0.192 0.169 0.224 0.209 0.12 0.2 0.215 0.134 ...
##  $ Punavih.p               : num  0.431 0.308 0.317 0.273 0.31 0.286 0.15 0.267 0.274 0.221 ...
##  $ Naisia_2015.p           : num  0.533 0.532 0.527 0.527 0.551 ...
##  $ Ruotsinkielisia_2015.p  : num  0.1066 0.0404 0.0416 0.0292 0.0832 ...
##  $ Korkeakoulutus2015.p    : num  0.378 0.309 0.33 0.407 0.452 ...
##  $ Tyottomat2014.p         : num  0.0972 0.1579 0.1453 0.1229 0.0956 ...
##  $ SuurituloisinX_2014.p   : num  0.0689 0.0603 0.0945 0.1322 0.0909 ...
##  $ Allemediaanitulot_2014.p: num  0.44 0.495 0.458 0.365 0.382 ...
##  $ Nolla_17.p              : num  0.129 0.154 0.157 0.218 0.116 ...
##  $ Kahdeksantoista_29.p    : num  0.275 0.151 0.162 0.148 0.251 ...
##  $ Kolmekymmenta_44.p      : num  0.264 0.193 0.192 0.224 0.253 ...
##  $ Neljakymmentaviisi_64.p : num  0.205 0.277 0.277 0.277 0.213 ...
##  $ Yli_65.p                : num  0.127 0.225 0.212 0.134 0.166 ...
##  $ Suurinpuolue            : Factor w/ 4 levels "KOK","PS","SDP",..: 4 3 1 1 1 1 1 1 1 1 ...
summary(df)
##       aanestysalue_nro                    aanestysalue_nimi
##  10A, 10B     : 1      Ala-Malmi A                 : 1     
##  10C          : 1      Ala-Malmi B                 : 1     
##  11A          : 1      Alppila A, Alppila B        : 1     
##  11B, 11C, 11D: 1      Alppila C, Alppila D        : 1     
##  11E, 11F     : 1      Eira                        : 1     
##  12A, 12B     : 1      Etelä-Haaga A, Etelä-Haaga B: 1     
##  (Other)      :99      (Other)                     :99     
##  Kuluyks_keskiarvo2014 Kuluyks_mediaani2014    Gini2014     
##  Min.   : 21005        Min.   :19565        Min.   :0.1920  
##  1st Qu.: 24801        1st Qu.:22578        1st Qu.:0.2505  
##  Median : 27927        Median :25032        Median :0.2712  
##  Mean   : 30885        Mean   :26012        Mean   :0.2943  
##  3rd Qu.: 32669        3rd Qu.:28523        3rd Qu.:0.3000  
##  Max.   :100732        Max.   :42395        Max.   :0.6305  
##                                                             
##       IP.p               KA.p               KD.p             KESK.p       
##  Min.   :0.000000   Min.   :0.000000   Min.   :0.00600   Min.   :0.03600  
##  1st Qu.:0.002200   1st Qu.:0.000500   1st Qu.:0.01350   1st Qu.:0.05800  
##  Median :0.003200   Median :0.000900   Median :0.01820   Median :0.07460  
##  Mean   :0.003497   Mean   :0.000981   Mean   :0.01822   Mean   :0.07412  
##  3rd Qu.:0.004400   3rd Qu.:0.001300   3rd Qu.:0.02380   3rd Qu.:0.08940  
##  Max.   :0.010900   Max.   :0.004000   Max.   :0.03100   Max.   :0.13300  
##                                                                           
##      KOK.p            KTP.p               Muut.p        
##  Min.   :0.0891   Min.   :0.0000000   Min.   :0.000000  
##  1st Qu.:0.1632   1st Qu.:0.0000000   1st Qu.:0.001500  
##  Median :0.2420   Median :0.0000000   Median :0.002100  
##  Mean   :0.2568   Mean   :0.0001867   Mean   :0.002332  
##  3rd Qu.:0.3240   3rd Qu.:0.0002000   3rd Qu.:0.003000  
##  Max.   :0.5260   Max.   :0.0031000   Max.   :0.009500  
##                                                         
##   Muutos2011.p       Piraattip..p          PS.p            RKP.p        
##  Min.   :0.000000   Min.   :0.00400   Min.   :0.0345   Min.   :0.01500  
##  1st Qu.:0.000700   1st Qu.:0.01020   1st Qu.:0.0760   1st Qu.:0.03610  
##  Median :0.001100   Median :0.01300   Median :0.1130   Median :0.04800  
##  Mean   :0.001374   Mean   :0.01416   Mean   :0.1216   Mean   :0.06357  
##  3rd Qu.:0.002000   3rd Qu.:0.01700   3rd Qu.:0.1603   3rd Qu.:0.07800  
##  Max.   :0.006200   Max.   :0.03260   Max.   :0.2950   Max.   :0.23300  
##                                                                         
##      SDP.p            SKP.p             STP.p              VAS.p        
##  Min.   :0.0510   Min.   :0.00000   Min.   :0.000000   Min.   :0.02100  
##  1st Qu.:0.1161   1st Qu.:0.00250   1st Qu.:0.000000   1st Qu.:0.06810  
##  Median :0.1630   Median :0.00360   Median :0.000500   Median :0.08930  
##  Mean   :0.1603   Mean   :0.00431   Mean   :0.000699   Mean   :0.09724  
##  3rd Qu.:0.2040   3rd Qu.:0.00540   3rd Qu.:0.001000   3rd Qu.:0.11490  
##  Max.   :0.2984   Max.   :0.01600   Max.   :0.004600   Max.   :0.24090  
##                                                                         
##      VIHR.p         Punavih.p      Naisia_2015.p    Ruotsinkielisia_2015.p
##  Min.   :0.0760   Min.   :0.1010   Min.   :0.4756   Min.   :0.01530       
##  1st Qu.:0.1338   1st Qu.:0.2140   1st Qu.:0.5172   1st Qu.:0.03080       
##  Median :0.1690   Median :0.2628   Median :0.5285   Median :0.04020       
##  Mean   :0.1806   Mean   :0.2779   Mean   :0.5291   Mean   :0.05786       
##  3rd Qu.:0.2240   3rd Qu.:0.3222   3rd Qu.:0.5397   3rd Qu.:0.07070       
##  Max.   :0.3257   Max.   :0.5411   Max.   :0.5758   Max.   :0.21980       
##                                                                           
##  Korkeakoulutus2015.p Tyottomat2014.p  SuurituloisinX_2014.p
##  Min.   :0.1274       Min.   :0.0230   Min.   :0.0053       
##  1st Qu.:0.3117       1st Qu.:0.0822   1st Qu.:0.0526       
##  Median :0.3999       Median :0.1030   Median :0.0921       
##  Mean   :0.3959       Mean   :0.1101   Mean   :0.1187       
##  3rd Qu.:0.4676       3rd Qu.:0.1343   3rd Qu.:0.1473       
##  Max.   :0.6044       Max.   :0.2110   Max.   :0.4240       
##                                                             
##  Allemediaanitulot_2014.p   Nolla_17.p     Kahdeksantoista_29.p
##  Min.   :0.1291           Min.   :0.0361   Min.   :0.0836      
##  1st Qu.:0.2672           1st Qu.:0.1353   1st Qu.:0.1446      
##  Median :0.3816           Median :0.1700   Median :0.1821      
##  Mean   :0.3820           Mean   :0.1714   Mean   :0.1904      
##  3rd Qu.:0.4662           3rd Qu.:0.2055   3rd Qu.:0.2235      
##  Max.   :0.6161           Max.   :0.3214   Max.   :0.3868      
##                                                                
##  Kolmekymmenta_44.p Neljakymmentaviisi_64.p    Yli_65.p      Suurinpuolue
##  Min.   :0.1498     Min.   :0.1720          Min.   :0.0046   KOK :58     
##  1st Qu.:0.1890     1st Qu.:0.2172          1st Qu.:0.1352   PS  : 2     
##  Median :0.2159     Median :0.2576          Median :0.1658   SDP :24     
##  Mean   :0.2248     Mean   :0.2512          Mean   :0.1622   VIHR:21     
##  3rd Qu.:0.2532     3rd Qu.:0.2777          3rd Qu.:0.1907               
##  Max.   :0.3536     Max.   :0.3411          Max.   :0.2568               
## 

I strongly reccommed using my Shiny applications to explore the data, but for a quick overview, summary produces sufficient information. Kokoomus has the biggest share of support in Helsinki, SDP is the second by the ammount of won regions and Vihreät third, but Vihreät has higher median in vote share than SDP. So we can say that SDP’s support is more concentrated. There are big differences in the proportion of age groups, wealth, higher education, share of Swedish speaking people between the regions.

Let’s continue with the exploration of the data by doing correlation matrix with ggpairs function. Unluckily we have so many variables (34 of which 31 is used for analysis) that we cannot visualize them all in one plot and by splitting the columns into two parts does not include all combinations of variables.

library(GGally)
library(ggplot2)

my_custom_cor_color <- function(data, mapping, color = I("black"), sizeRange = c(1, 5), ...) {
  
  # get the x and y data to use the other code
  x <- eval(mapping$x, data)
  y <- eval(mapping$y, data)
  
  ct <- cor.test(x,y)
  
  r <- unname(ct$estimate)
  rt <- format(r, digits=2)[1]
  tt <- as.character(rt)
  
  # plot the cor value
  p <- ggally_text(
    label = tt, 
    mapping = aes(),
    xP = 0.5, yP = 0.5, 
    size = 6,
    color=color,
    ...
  ) +
    
    theme(
      panel.background=element_rect(fill="white", color = "black"),
      panel.grid.minor=element_blank(),
      panel.grid.major=element_blank()
    ) 
  
  corColors <- RColorBrewer::brewer.pal(n = 7, name = "RdYlGn")[2:6]
  
  if (r <= -0.8) {
    corCol <- corColors[1]
  } else if (r <= -0.6) {
    corCol <- corColors[2]
  } else if (r < 0.6) {
    corCol <- corColors[3]
  } else if (r < 0.8) {
    corCol <- corColors[4]
  } else {
    corCol <- corColors[5]
  }
  p <- p + theme(
    panel.background = element_rect(fill= corCol)
  )
  p
}

# sadly this hack does not work anymore to make the colors customly
myColors <- c("#0066FF","#4D4D4D","#FF3333", "#00B33C")
names(myColors) <- levels(df$Suurinpuolue)
colScale <- scale_colour_manual(name = "Suurinpuolue", values = myColors)

ggplot <- function(...) ggplot2::ggplot(...) + colScale
ggpairs(
  df[,3:length(df)],
    columns = 1:17, ## tata pitaa sit modaa
  mapping = ggplot2::aes(color = Suurinpuolue),
  upper = list(continuous = my_custom_cor_color),
  diag = list(combo = colScale),
  lower = list(
    combo = colScale
  )
)

ggplot <- function(...) ggplot2::ggplot(...) + colScale
ggpairs(
  df[,3:length(df)],
    columns = 18:31, ## tata pitaa sit modaa
  mapping = ggplot2::aes(color = Suurinpuolue),
  upper = list(continuous = my_custom_cor_color),
  diag = list(combo = colScale),
  lower = list(
    combo = colScale
  )
)

I tried to custom set the colours for ggpairs plots, as the data points are separated two four groups based on Suurinpuolue variable, but this proved not to be possible.

We can observe from the later correlation matrix that Punavih.p correlates strongly with young age, slightly positively with share of women, slightly negatively with high education and more negatively with share of people in the highest income decile. We can easily detect that VIHR.p has a mild positive correlation with high education (0.28), but because VAS.p has a stronger negative one the correlation between Punavih.p and higher education is around zero.

The Method RQ1

I am seeking an answer to the research question: Is there a green-red voter bubble in Helsinki?

To answer this I chose my statistical method to be PCA (principal component analysis) which suits well for multidimensional data, as PCA reduces the dimensions of the data to its main components. PCA seeks to find surfaces (in the data space) into which the projected data achieves the highest amount of variance. This ensures that the least amount of information of the data gets being lost.

PCA’s biplot illustration makes it possible to visualize connections between variables, components and the data. Our task is to visually inspect which variables contribute to which directions and what kind of interplay of variables can we determine.

My hypothesis is that there is a green-red bubble in which includes Kallio, Sörnäinen and Vallila regions. To validate the hypothesis we should expect to see variables from both Vasemmistoliitto and Vihreät to point towards the same direction. But more importantly their joint variable VAS+VIHR to point point towards the data points of Kallio, Sörnäinen and Vallilla.

The Results of RQ1

To perform the analysis we need to scale the variables first. This requires that we get rid of all non-numeric variables. Then I will rename the variables for readability of the biplot. Let’s take a look at the newly created scaled variables:

df_scaled <- df %>%
  dplyr::select(-aanestysalue_nro, -aanestysalue_nimi, -Suurinpuolue)
df_scaled <- scale(df_scaled)
summary(df_scaled)
##  Kuluyks_keskiarvo2014 Kuluyks_mediaani2014    Gini2014       
##  Min.   :-0.9293       Min.   :-1.4327      Min.   :-1.35950  
##  1st Qu.:-0.5723       1st Qu.:-0.7631      1st Qu.:-0.58224  
##  Median :-0.2782       Median :-0.2177      Median :-0.30721  
##  Mean   : 0.0000       Mean   : 0.0000      Mean   : 0.00000  
##  3rd Qu.: 0.1678       3rd Qu.: 0.5581      3rd Qu.: 0.07544  
##  Max.   : 6.5700       Max.   : 3.6408      Max.   : 4.46665  
##       IP.p              KA.p               KD.p          
##  Min.   :-1.7239   Min.   :-1.14047   Min.   :-2.008180  
##  1st Qu.:-0.6394   1st Qu.:-0.55916   1st Qu.:-0.775760  
##  Median :-0.1465   Median :-0.09412   Median :-0.003443  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.000000  
##  3rd Qu.: 0.4451   3rd Qu.: 0.37093   3rd Qu.: 0.916764  
##  Max.   : 3.6491   Max.   : 3.50998   Max.   : 2.099888  
##      KESK.p             KOK.p             KTP.p              Muut.p       
##  Min.   :-2.03549   Min.   :-1.6040   Min.   :-0.46749   Min.   :-1.4818  
##  1st Qu.:-0.86070   1st Qu.:-0.8952   1st Qu.:-0.46749   1st Qu.:-0.5288  
##  Median : 0.02573   Median :-0.1414   Median :-0.46749   Median :-0.1476  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.81605   3rd Qu.: 0.6429   3rd Qu.: 0.03339   3rd Qu.: 0.4241  
##  Max.   : 3.14427   Max.   : 2.5752   Max.   : 7.29621   Max.   : 4.5537  
##   Muutos2011.p      Piraattip..p          PS.p             RKP.p        
##  Min.   :-1.2464   Min.   :-1.8007   Min.   :-1.5670   Min.   :-1.1482  
##  1st Qu.:-0.6115   1st Qu.:-0.7017   1st Qu.:-0.8205   1st Qu.:-0.6494  
##  Median :-0.2488   Median :-0.2054   Median :-0.1550   Median :-0.3681  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5675   3rd Qu.: 0.5036   3rd Qu.: 0.6958   3rd Qu.: 0.3411  
##  Max.   : 4.3766   Max.   : 3.2687   Max.   : 3.1187   Max.   : 4.0053  
##      SDP.p              SKP.p             STP.p             VAS.p        
##  Min.   :-1.89884   Min.   :-1.5230   Min.   :-0.9094   Min.   :-1.6827  
##  1st Qu.:-0.76763   1st Qu.:-0.6395   1st Qu.:-0.9094   1st Qu.:-0.6432  
##  Median : 0.04733   Median :-0.2508   Median :-0.2589   Median :-0.1753  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.75977   3rd Qu.: 0.3854   3rd Qu.: 0.3915   3rd Qu.: 0.3897  
##  Max.   : 2.40011   Max.   : 4.1315   Max.   : 5.0749   Max.   : 3.1706  
##      VIHR.p          Punavih.p       Naisia_2015.p    
##  Min.   :-1.8364   Min.   :-1.9139   Min.   :-2.8622  
##  1st Qu.:-0.8216   1st Qu.:-0.6910   1st Qu.:-0.6368  
##  Median :-0.2037   Median :-0.1629   Median :-0.0323  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.7619   3rd Qu.: 0.4799   3rd Qu.: 0.5669  
##  Max.   : 2.5473   Max.   : 2.8488   Max.   : 2.4981  
##  Ruotsinkielisia_2015.p Korkeakoulutus2015.p Tyottomat2014.p  
##  Min.   :-0.9937        Min.   :-2.47165     Min.   :-2.3395  
##  1st Qu.:-0.6318        1st Qu.:-0.77483     1st Qu.:-0.7486  
##  Median :-0.4123        Median : 0.03721     Median :-0.1896  
##  Mean   : 0.0000        Mean   : 0.00000     Mean   : 0.0000  
##  3rd Qu.: 0.2998        3rd Qu.: 0.66051     3rd Qu.: 0.6516  
##  Max.   : 3.7810        Max.   : 1.92001     Max.   : 2.7128  
##  SuurituloisinX_2014.p Allemediaanitulot_2014.p   Nolla_17.p      
##  Min.   :-1.2163       Min.   :-2.100604        Min.   :-2.42592  
##  1st Qu.:-0.7090       1st Qu.:-0.953722        1st Qu.:-0.64688  
##  Median :-0.2854       Median :-0.003662        Median :-0.02458  
##  Mean   : 0.0000       Mean   : 0.000000        Mean   : 0.00000  
##  3rd Qu.: 0.3067       3rd Qu.: 0.698917        3rd Qu.: 0.61207  
##  Max.   : 3.2743       Max.   : 1.943795        Max.   : 2.69061  
##  Kahdeksantoista_29.p Kolmekymmenta_44.p Neljakymmentaviisi_64.p
##  Min.   :-1.7318      Min.   :-1.7078    Min.   :-1.9778        
##  1st Qu.:-0.7426      1st Qu.:-0.8155    1st Qu.:-0.8495        
##  Median :-0.1344      Median :-0.2032    Median : 0.1590        
##  Mean   : 0.0000      Mean   : 0.0000    Mean   : 0.0000        
##  3rd Qu.: 0.5369      3rd Qu.: 0.6459    3rd Qu.: 0.6608        
##  Max.   : 3.1851      Max.   : 2.9313    Max.   : 2.2434        
##     Yli_65.p       
##  Min.   :-3.53969  
##  1st Qu.:-0.60589  
##  Median : 0.08151  
##  Mean   : 0.00000  
##  3rd Qu.: 0.64087  
##  Max.   : 2.12574

Next I will show the coefficients for the first three PCA components. There we can observe that age groups: 18-29v and 30-44v have the biggest coefficients in absolute terms in PC2 component. This is interesting because Vasemmistoliitto and Vihreät also have highly negatvie coefficients in PC2 component. I will talk more about the components and their relations with variables more after the biplot.

df_scaled  <- as.data.frame(df_scaled)

# tama pistaa nimet kuntoon
min <- which(colnames(df_scaled) == "Naisia_2015.p")
max <- which(colnames(df_scaled) == "Yli_65.p")
colnames(df_scaled)[min:max] <-  c("Naiset", "Ruotsinkieliset", "Korkeakoulutus", "Työttömyys", "Suurituloisin 10 %", "Pienituloisin 50%", "0-17v", "18-29v", "30-44v", "45-64v", "65v+", "Gini")
colnames(df_scaled)[1:3] <- c("Mediaanitulot", "Keskiarvotulot", "Gini")


# Renaming the variables for readability
colnames(df_scaled)[4:20] <-  c("IP", "KA", "KD", "Keskusta", "Kokoomus", "KTP", "Muut", "Muutos2011", "Piraatit", "Perussuomalaiset", "RKP", "SDP", "SKP", "STP", "Vasemmistoliitto", "Vihreät", "VAS+VIHR")

rownames(df_scaled) <- df$aanestysalue_nimi
pca_puolue <- prcomp(df_scaled)

s <- summary(pca_puolue)
s$rotation[,1:3]
##                             PC1         PC2         PC3
## Mediaanitulot       0.234399509  0.08310528  0.11800972
## Keskiarvotulot      0.254426626  0.12478256 -0.07826694
## Gini                0.234216646 -0.02315811  0.24763613
## IP                 -0.216130639  0.04254659  0.15039153
## KA                 -0.183561660  0.03531061  0.16377006
## KD                 -0.190397143  0.14799140  0.01934255
## Keskusta           -0.082342786  0.22135931 -0.19244349
## Kokoomus            0.260279581  0.10642263 -0.03960320
## KTP                -0.090437119  0.01860063  0.16705683
## Muut               -0.137805898  0.02188363  0.08208792
## Muutos2011         -0.165522083  0.08910878  0.03808296
## Piraatit           -0.140084657 -0.25372259 -0.03392582
## Perussuomalaiset   -0.225173061  0.18828384 -0.07444517
## RKP                 0.220206952 -0.01417146  0.30158756
## SDP                -0.246690882  0.13257410  0.05537124
## SKP                -0.185226917 -0.06944672  0.04055914
## STP                -0.145447276  0.02637016  0.15937758
## Vasemmistoliitto   -0.147679616 -0.26729525 -0.03222177
## Vihreät             0.036088345 -0.35476476 -0.05943911
## VAS+VIHR           -0.050385016 -0.34965018 -0.05292101
## Naiset             -0.008221606 -0.09073054  0.38532431
## Ruotsinkieliset     0.217229666 -0.02401998  0.32701094
## Korkeakoulutus      0.270241546 -0.05353298 -0.07056864
## Työttömyys         -0.241113640  0.02624027  0.25031155
## Suurituloisin 10 %  0.263706843  0.06567868  0.08614090
## Pienituloisin 50%  -0.252074803 -0.09713782  0.17575416
## 0-17v              -0.023470189  0.30424603 -0.21743284
## 18-29v             -0.035089973 -0.32855875 -0.01174990
## 30-44v              0.064634805 -0.32171061 -0.15152342
## 45-64v             -0.020168730  0.31313394 -0.04101381
## 65v+                0.032479295  0.10989205  0.47493891
pca_pr <- round(100*s$importance[2, ], digits = 1)
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

biplot(pca_puolue, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2], xlim=c(-0.25, 0.25), ylim=c(-0.25, 0.25))
title(main = "PCA analysis of Helsinki voting regions")

As we can see from the biplot, there is a clear indication of a red-green bubble (punavihreä kupla). Both Vasemmistoliitto and Vihreät are pointing in the same general direction (highly negative in respect to the second PCA component), but Vasemmistoliitto points points more to the left and Vihreät slightly towards right. If we look at their joint variable VAS+VIHR, it is pointing directly towards the regions of Kallio, Vallilla and Alppila (also Sörnäinen).

We can deduct from the biplot that the second component mainly consits of age group variables and share of women if we are not looking at party variables. More negative the PC2 component, the more young people and women are located in the region. More positive the PC2 component, the more older population there is in the region.

On the other hand, PC1 seems to be strongly linked to socio-economic variables (income, unempolyment and education). We can observe Kokoomus (the right-wing party) pointing towards right (slightly upwards), thus being linked to old age and economic prosperity. On the right slightly downwards direction we have the Swedish speaking people, which seem to live in the same areas where also the share of higher education is high.

On the left side of the biplot we have the social problems of unemployment and share of bellow median income people living in the region. Populist parties (Perussuomalaiset) in combination with the social democrats (SDP) are pointing to that direction.

We can conclude that our first hypothesis was confirmed. There is clear statistical evidence of green-red bubble.

The Method RQ2

My second research qestion was: What regional characteristics predict, left-wing party, Vasemmistoliitto’s support?

To approach this question I will use LDA (linear discriminant analysis). LDA differs from PCA so that it is supervised learning mehtod, where as PCA is unsupervised. LDA, similarly than PCA, also looks for linear combinations which could explain the data. LDA method seeks to minimize the differences within groups and maximize the differences between groups.

To use LDA we will need a categorical variable. Therefore, I will cut Vasemmistoliitto party’s support into four sections (quantiles): low, med_low, med_high and high. I will only use the background variables as explanatory variables in the model, meaning that all voting results regarding other than Vasemmistoliitto will be excluded from the model.

I will divide my 105 observations into a training set (80% of the observations) and test set (20%). I will use the training set for learning my model and then test set will validate its performance. I will use a biplot to visualize my results.

The Results of RQ1

Next I will perform the analysis. We will notice that the results of the model will not be easily interpretable.

# analysis for the leftwing party
scaled_VAS <- df_scaled[, "Vasemmistoliitto"]
summary(scaled_VAS)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.6830 -0.6432 -0.1753  0.0000  0.3897  3.1710
vas <- as.numeric(cut_number(scaled_VAS,4)) %>%
  factor(labels = c("low", "med_low", "med_high", "high"))
table(vas)
## vas
##      low  med_low med_high     high 
##       27       26       26       26
min <- which(colnames(df_scaled) == "IP")
max <- which(colnames(df_scaled) == "VAS+VIHR")

df2_scaled <- dplyr::select(df_scaled, -Vasemmistoliitto, -(min:max)) # vain taustamuuttujat
df2_scaled <- data.frame(df2_scaled, vas)


set.seed(12)
ind <- nrow(df2_scaled) %>%
  sample(. * 0.8)

train <- df2_scaled[ind,]
test <- df2_scaled[-ind,]

correct_classes <- test[, "vas"]
# test set ready for prediction
test <- dplyr::select(test, -vas)
library(MASS)
lda.fit <- lda(vas ~. , data = train)
lda.fit
## Call:
## lda(vas ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2500000 0.2500000 0.2380952 0.2619048 
## 
## Group means:
##          Mediaanitulot Keskiarvotulot       Gini      Naiset
## low       0.8690290171     1.04516318  0.6222789 -0.05064286
## med_low  -0.0003225013     0.05519139  0.1507960  0.07698326
## med_high -0.5297775845    -0.70313406 -0.4731595  0.36330136
## high     -0.4980659953    -0.62563231 -0.3990084 -0.23218180
##          Ruotsinkieliset Korkeakoulutus Työttömyys Suurituloisin.10..
## low           0.96946441      0.8990153 -0.8506727          0.9427153
## med_low       0.08934603      0.1368221 -0.2584241          0.0941561
## med_high     -0.44196041     -0.7782810  0.7536965         -0.6358014
## high         -0.48638579     -0.4326277  0.4365850         -0.6012617
##          Pienituloisin.50.     X0.17v     X18.29v       X30.44v
## low             -0.9972227  0.4892705 -0.57066488 -0.2744735731
## med_low         -0.1753324 -0.3820600  0.07837955 -0.0008888253
## med_high         0.8160553  0.2203989  0.23196871 -0.2954531008
## high             0.6187012 -0.5526493  0.53028520  0.5532917476
##              X45.64v      X65v.
## low       0.24306291  0.2304170
## med_low   0.01364614  0.3586761
## med_high -0.05228444 -0.2594919
## high     -0.30834241 -0.3112001
## 
## Coefficients of linear discriminants:
##                              LD1         LD2         LD3
## Mediaanitulot         0.48873625   1.2687191   1.5819576
## Keskiarvotulot       -1.48325356  -1.0299530  -0.7587861
## Gini                  0.50763285  -1.1953653  -3.5491880
## Naiset                0.41992611   0.7281170  -0.6063731
## Ruotsinkieliset      -0.99711563  -0.3226375   0.9815081
## Korkeakoulutus       -0.59844039  -0.2592145   1.7460963
## Työttömyys            0.04274597   1.6410836   0.4759254
## Suurituloisin.10..    0.56013106   1.9459009   2.0710354
## Pienituloisin.50.     0.16366076  -1.2262569   2.1052261
## X0.17v             -108.48397045 -55.5743239 117.8974516
## X18.29v            -119.35884863 -61.7610059 130.4354907
## X30.44v             -84.34260500 -45.4237723  93.6319928
## X45.64v             -76.80112764 -41.6697259  85.1539016
## X65v.               -86.72455798 -45.9156593  94.1146122
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.7561 0.1868 0.0571
library(ggbiplot)
vas_train <- train[,"vas"]
ggbiplot(lda.fit, obs.scale = 1, var.scale = 1,
  groups = vas_train, ellipse = TRUE, circle = TRUE, labels = row.names(train)) +
  scale_color_discrete(name = '') +
  theme(legend.direction = 'horizontal', legend.position = 'top') + ggtitle("Biplot of Vasemmistoliito's support groups")

lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low        4       2        0    0
##   med_low    1       3        1    0
##   med_high   0       2        4    0
##   high       0       1        1    2

As you can see the results are a mess as the age group variables get very high coefficients ruining the interpretability of the model. The model does not perform that badly, but because the interplay of age groups (one age group increasing its share will make the space smaller for rest of the age groups as they sum up to 1). I will redo the analysis by replacing age variables with a new variable that is a ratio of 18-44 year olds to 45- onwards. This would hopefully solve the issue.

ikasuhde <- df %>%
  dplyr::mutate(nuoret_vanhat_suhde = (Kahdeksantoista_29.p + Kolmekymmenta_44.p) / (Neljakymmentaviisi_64.p + Yli_65.p))

ikasuhde <- ikasuhde %>%
  dplyr::select(nuoret_vanhat_suhde)

ikasuhde_scaled <- scale(ikasuhde)

min <- which(colnames(df2_scaled) == "X0.17v")
max <- which(colnames(df2_scaled) == "X65v.")
df3_scaled <- df2_scaled %>%
  dplyr::select(-(min:max))
df3_scaled["Nuoret-vanhat-suhde"] <- ikasuhde_scaled[,1]

# now redo analysis
set.seed(12)
ind <- nrow(df2_scaled) %>%
  sample(. * 0.8)

train <- df3_scaled[ind,]
test <- df3_scaled[-ind,]

correct_classes <- test[, "vas"]
# test set ready for prediction
test <- dplyr::select(test, -vas)
library(MASS)
lda.fit <- lda(vas ~. , data = train)
lda.fit
## Call:
## lda(vas ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2500000 0.2500000 0.2380952 0.2619048 
## 
## Group means:
##          Mediaanitulot Keskiarvotulot       Gini      Naiset
## low       0.8690290171     1.04516318  0.6222789 -0.05064286
## med_low  -0.0003225013     0.05519139  0.1507960  0.07698326
## med_high -0.5297775845    -0.70313406 -0.4731595  0.36330136
## high     -0.4980659953    -0.62563231 -0.3990084 -0.23218180
##          Ruotsinkieliset Korkeakoulutus Työttömyys Suurituloisin.10..
## low           0.96946441      0.8990153 -0.8506727          0.9427153
## med_low       0.08934603      0.1368221 -0.2584241          0.0941561
## med_high     -0.44196041     -0.7782810  0.7536965         -0.6358014
## high         -0.48638579     -0.4326277  0.4365850         -0.6012617
##          Pienituloisin.50. `Nuoret-vanhat-suhde`
## low             -0.9972227           -0.31583562
## med_low         -0.1753324           -0.13253126
## med_high         0.8160553            0.04993443
## high             0.6187012            0.42835446
## 
## Coefficients of linear discriminants:
##                               LD1        LD2         LD3
## Mediaanitulot         -0.23121990  1.6182308  0.83179678
## Keskiarvotulot        -0.05226718 -0.2543018 -0.01458681
## Gini                   1.60409872 -3.6449498 -1.95169677
## Naiset                 0.18180171  1.0370667 -1.10309254
## Ruotsinkieliset       -0.97399476  0.4805874  0.73548781
## Korkeakoulutus        -0.20163417 -1.6883704  2.84990125
## Työttömyys             0.77181436  1.0114733  0.70652848
## Suurituloisin.10..    -1.17154171  3.7195498  0.08768411
## Pienituloisin.50.     -0.25284737 -0.2826991  2.10547380
## `Nuoret-vanhat-suhde`  0.61033463  1.3087730 -0.08191412
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.7919 0.1365 0.0716
library(ggbiplot)
vas_train <- train[,"vas"]
ggbiplot(lda.fit, obs.scale = 1, var.scale = 1,
  groups = vas_train, ellipse = TRUE, circle = TRUE, labels = row.names(train)) +
  scale_color_discrete(name = '') +
  theme(legend.direction = 'horizontal', legend.position = 'top') + ggtitle("Biplot of Vasemmistoliito's support groups")

lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low        3       3        0    0
##   med_low    1       3        1    0
##   med_high   0       1        4    1
##   high       0       1        1    2

Now we can see interpret the results much better. The results are a bit different LD1 has now a higher share of the captured trace. The better interpretability comes up with a price of misclassifying one observation more in the updated model than in the first one.

If we look at the second biplot, we can see that Gini is interstingly pointing towards high support for Vasemmistoliitto. If you go to my helsinkiclusters-app you can see that the data clearly contradicts this: higher the gini coeffcient, lower the support for Vasemmistoliitto. Also if we inspect the group means we can see clear negative relation to high and med_high support with gini coefficient. Share of women is difficult to interpret as high share of women is connected to med_high support, but not to high support. This means that Vasemmistoliitto’s voter base is diverse.

If we look at the biplot, we can see that there is a signifficant overlap between med_low, med_high and high support groups. Therefore, it is very difficult to create simple rules of thumb to say which variable leads certainly to high support for Vasemmistoliitto. We can say which have a negative effect more easily. The share of Swedish speaking people and high share of people belonging into the highest decile in income distribution certainly affect negatively to Vasemmistoliitto’s support. Interestingly we can find evidence that high share of small income people also predict diminished support for Vasemmistoliitto (becuse SDP and Perussuomalaiset are high on those regions).

Conclusions and Discussion

We can conclude that our hypothesis for the first research question was confirmed. There is clear statistical evidence of green-red bubble in Kallio, Sörnäinen and Vallilla with some other nearby regions. These regions are mainly inhabited by young people and high share of women.

For the second research question it is hard to give a simple answer, but we can see from the data that high levels of income, education and share of Swedish speaking population affect negatively to Vasemmistoliitto’s support. Higher support is linked to unemployment, high ratio of young people over old in the region. Vasemmistoliitto appears to have a diverse voter as most left-wing parties have: the working class wing (mainly old working people with little education) and academic wing (mainly young). Both groups wings are joined by low income. With variables in the dataset Vasemmistoliitto’s support is not very easily easily classifiable. The accuracy might be better if there was a variable indicating students attending higher levels of education (not just completed higher levels of education).